Overview¶

This notebook demonstrates how to use Monet to cluster scRNA-Seq data with the Galapagos (Wagner, 2019). This method relies on performing density-based clustering (using DBSCAN) directly on the t-SNE result. It's a very straightforward approach based on the somewhat obvious notion that t-SNE plots provide a great starting point for defining cell populations. The approach is limited in its ability to resolve closely related cell types that don't separate well in t-SNE plots, but it's a very simple and transparent approach that also tends to avoid overclustering.

Setting up the notebook¶

In [1]:

# change notebook width and font
from IPython.core.display import HTML, display
display(HTML("""<style>
    /* source: http://stackoverflow.com/a/24207353 */
    .container { width:95% !important; }
    div.prompt, div.CodeMirror pre, div.output_area pre { font-family:'Hack', monospace; font-size: 10.5pt; }
    </style>"""))

from monet import util
_LOGGER = util.configure_logger()

# the following is to allow embedding of plotly figures
from plotly.offline import init_notebook_mode
import plotly.graph_objs as go
init_notebook_mode(connected=True)

Step 1: Perform t-SNE¶

Here, we perform t-SNE as in the previous tutorial.

In [2]:

import gc

from monet import ExpMatrix
from monet import visualize

expression_file = 'data/v3_human_pbmc_10k_expression.npz'

matrix = ExpMatrix.load_npz(expression_file)

fig, tsne_scores = visualize.tsne_plot(matrix, title='PBMC data')
# by default, tsne_plot() performs PCA with 30 principal components
# this can be changed, e.g. to 50, using tsne_plot(..., num_components=50)

fig.show()

# free up memory
del matrix; gc.collect()

[2020-06-17 14:30:28] (monet.core.exp_matrix) INFO: Loaded expression matrix with 10681 cells and 16319 genes -- .npz format, 36.7 MB (hash: f9d7fac20f4de6184ff55388c267699a).
[2020-06-17 14:30:28] (root) INFO: No Monet model provided, performing PCA to determine first 30principal components...
[2020-06-17 14:30:28] (monet.latent.pca_model) INFO: Converted matrix to float32 data type.
[2020-06-17 14:30:34] (monet.latent.pca_model) INFO: The PCA took 1.5 s.
[2020-06-17 14:30:34] (monet.latent.pca_model) INFO: The fraction of variance explained by the 30 selected PCs is 33.4 %.
[2020-06-17 14:30:34] (root) INFO: Performing t-SNE...
[2020-06-17 14:31:10] (root) INFO: t-SNE took 35.6 s.

Out[2]:

Step 2: Clustering with DBSCAN¶

We'll now apply DBSCAN (Ester et al., 1996), a density-based clustering algorithm, to the t-SNE result. DBSCAN has two parameters, called Eps and MinPts (called min_samples in scikit-learn). Eps defines a radius for finding neighbors, and MinPts defines the minimum number of points (here: cells) that need to fall within that radius for a cluster to be formed (some cells won't be assigned to clusters and will be considered "outliers"). You can read more about the DBSCAN algorithm in the scikit-learn User Manual, and visit Naftali Harris' website to look at some nice demonstrations of DBSCAN on various datasets.

In Monet, you specify Eps as the fraction of the diameter of the t-SNE plot, using the eps_frac parameter. Here, I use the term "diameter" to refer to the distance from the top-left corner to the bottom-right corner of the t-SNE plot. For example, setting eps_frac=0.03 (the default), means that Eps will be set to 3% of the diamater. Furthermore, you specify MinPts as a fraction of the total number of cells available, using the min_cells_frac parameter. So setting min_cells_frac=0.01 (the default) means that MinPts will be set to 1% of the total number of cells (rounded up to the next integer).

In [3]:

from monet.visualize import plot_cells
from monet.cluster import cluster_cells_dbscan

eps_frac = 0.03
min_cells_frac = 0.01

cell_labels, clusters = cluster_cells_dbscan(
    tsne_scores, eps_frac=eps_frac, min_cells_frac=min_cells_frac)

cluster_colors = {
    'Outliers': 'lightgray',
}

fig = plot_cells(
    tsne_scores,
    cell_labels=cell_labels,
    cluster_order=clusters,
    cluster_colors=cluster_colors,
    width=850)

fig.show()

[2020-06-17 14:31:10] (monet.cluster.galapagos) INFO: Performing DBSCAN with minPts=107 and eps=6.57.
[2020-06-17 14:31:11] (monet.cluster.galapagos) INFO: Clustering with DBSCAN took 0.9 s.

These clusters seem a little bit too broad. By tweaking the DBSCAN parameters, we can increase the clustering resolution.

In [4]:

from monet.visualize import plot_cells
from monet.cluster import cluster_cells_dbscan

#eps_frac = 0.03
#min_cells_frac = 0.01

eps_frac = 0.023
min_cells_frac = 0.007

cell_labels, clusters = cluster_cells_dbscan(
    tsne_scores, eps_frac=eps_frac, min_cells_frac=min_cells_frac)

cluster_colors = {
    'Outliers': 'lightgray',
}

fig = plot_cells(
    tsne_scores,
    cell_labels=cell_labels,
    cluster_order=clusters,
    cluster_colors=cluster_colors,
    width=850)

fig.show()

[2020-06-17 14:31:11] (monet.cluster.galapagos) INFO: Performing DBSCAN with minPts=75 and eps=5.04.
[2020-06-17 14:31:12] (monet.cluster.galapagos) INFO: Clustering with DBSCAN took 1.0 s.

Now that we are happy with the clustering result, we can save it to disk.

In [5]:

from monet import util

util.save_cell_labels(cell_labels, 'output/v3_human_pbmc_10k_clustering.tsv')

[2020-06-17 14:31:13] (monet.util.files) INFO: Saved labels for 10681 cells to tab-delimited plain-text file.